Parallel signatures of Mycobacterium tuberculosis and human Y-chromosome phylogeography support the Two Layer model of East Asian population history

The Two Layer hypothesis is fast becoming the favoured narrative describing East Asian population history. Under this model, hunter-gatherer groups who initially peopled East Asia via a route south of the Himalayas were assimilated by agriculturalist migrants who arrived via a northern route across Eurasia. A lack of ancient samples from tropical East Asia limits the resolution of this model. We consider insight afforded by patterns of variation within the human pathogen Mycobacterium tuberculosis (Mtb) by analysing its phylogeographic signatures jointly with the human Y-chromosome. We demonstrate the Y-chromosome lineages enriched in the traditionally hunter-gatherer groups associated with East Asia’s first layer of peopling to display deep roots, low long-term effective population size, and diversity patterns consistent with a southern entry route. These characteristics mirror those of the evolutionarily ancient Mtb lineage 1. The remaining East Asian Y-chromosome lineage is almost entirely absent from traditionally hunter-gatherer groups and displays spatial and temporal characteristics which are incompatible with a southern entry route, and which link it to the development of agriculture in modern-day China. These characteristics mirror those of the evolutionarily modern Mtb lineage 2. This model paves the way for novel host-pathogen coevolutionary research hypotheses in East Asia.


Supplementary Note 1. Y-Chromosome Haplogroup Frequency Analysis i) Dataset Compilation
Y-chromosome haplogroup frequency data was collated from previously published studies 1,2,3,4,5 .We ensured Y-chromosomes were genotyped to the current gold-standard in phylogenetic resolution 6 , which includes markers defining haplogroups L, T and NO* 6 .Only populations for which precise latitude and longitude coordinates could be retrieved from the study, or closely approximated using maps presented in the study, were used (Supplementary Figure 1).Supplementary Figure 1) Map showing locations of populations used for the spatial frequency analysis.

ii) K2b1 spatial frequency interpolation plot presented in Figure 3d
One of the haplogroups of interest to this study is K2b1.This haplogroup is defined by the markers P397 and P399 and constitutes a substantial proportion of Y-chromosome haplogroups in the populations of Oceania 7 .These two markers were discovered relatively recently and therefore weren't genotyped in all of the studies analysed here.
For the spatial frequency interpolation of lineage K2b1 (Figure 3d), only haplotypes which could be unambiguously assigned to haplogroups K2b1 were used, designated by the presence of downstream markers for haplogroups M and S 5 .This is a conservative choice and has minimal effect on the interpretation of this figure. 7demonstrate through a survey of over 7,000 K-M526 Y-chromosomes that the vast majority of Island Southeast Asian K-M526* haplogroups, which don't belong to haplogroups M or S, fall into the K2b1 clade.The spatial frequency estimates of the K2b1 haplogroup depicted in Figure 3d are therefore likely to be slightly underestimated in Island Southeast Asia.

Karafet et al. (2015)
Importantly, we find K2b1 to be highly unlikely to occur outside Island Southeast Asia.In our survey, there were 8 other populations from the sample of 298 populations which possessed K* haplogroups, most in the Middle East, and most with frequencies of below 1%.These lineages are likely to represent other K* haplotypes not belonging to K2b1, as Karafet et al. (2015) 7 show K2b1 to be completely absent from the region.

iii) Main Y-chromosome lineages in East Asian populations
We restricted the focus of our Y-chromosome analysis to the four predominant East Asian Ychromosome lineages.Our rationale for selecting these lineages is based on haplogroup frequencies from our survey, and data from prior studies which have assigned geographical provinces and origin points to haplogroups.To assess haplogroup frequencies we drew a geographical transect at longitude 90 and considered all populations to the east of this boundary (Supplementary Figure 2).This transect runs through the apex of the Bay of Bengal and approximates the division between South and East Asia.We calculated haplogroup frequencies as the mean value across all populations retained.Across all East Asian populations, we found haplogroup C to attain an average frequency of 11.6%, haplogroup D 4.8%, haplogroups K2b1/K* 4.5%, and haplogroup NO 72.7% (Supplementary Figure 3).All other Y-chromosome haplogroups had frequencies of less than 1%, aside from haplogroup R with a frequency of 2.5% (Supplementary Figure 3).As demonstrated through previous studies, R has a West Eurasian origin, and is present only on the fringes of East Asia, permeating the border of South Asia and Siberia 8,9 (Supplementary Figure 8).Supplementary Figure 2) Map showing geographic transect used to define East Asian populations for the analysis of Y-chromosome haplogroup frequencies.All populations to the east of the yellow line, drawn at longitude 90, were considered 'East Asian' for the purposes of this analysis.Supplementary Figure 3) Average haplogroup frequencies across all 'East Asian' populations.

iv) Haplogroup Frequency Enrichment in First Layer Populations
We summarised data presented in previous genetic studies to illustrate the enrichment of Ychromosome haplogroups C, D and K2b1 in the traditionally hunter-gatherer first layer populations of East Asia.We contrasted haplogroup frequencies in each of these populations with those of a nearby non-Indigenous group.As was the case for the Burmese 10 , the Japanese 11 , and the Philippines 12 , multiple nearby populations were pooled to maximise sample size, as detailed below.Some of the studies included in this survey were not genotyped to a high enough degree of resolution to unambiguously assign K* lineages to the K2b1 clade.To reflect this uncertainty these haplogroups were designated as K2b1/K*.The analysis of Karafet et al. (2015) 7 suggests that the vast majority of Oceania and Island Southeast Asian K* lineages do, however, fall into clade K2b1.Specific details of each chosen study, the precise genotyping scheme, and any additional notes are detailed below.

Papuan and Austronesian
Given the other main source of ancestry in present day Oceanic populations is derived from Austronesian migrants 15 , we chose a representative Taiwanese group to contrast Papuan haplogroup frequencies with.Taiwan is widely documented as the origin of the Austronesian expansion 16,17,18 .To represent these groups, we used a sample of 48 Taiwanese Ychromosomes from Karafet et al. ( 2010) 5 (Frequencies: C -0.021, D -0, K2b1 -0, NO -0.98,Other -0).Importantly, the conclusions of this analysis remain unchanged when alternate Oceanic reference populations are used, such as 'Micronesia', from Zhong et al. ( 2011) 1 .conducted included Mtb lineages 1 vs 2, and Y-chromosome haplogroups NO vs C, NO vs D and NO vs K2b1.In all comparisons, the growth rate of putative second layer lineages was significantly higher than that of putative first layer lineages (NO vs C, p<2.2x10 -16 ; NO vs D, p<2.2x10 -16 ; NO vs K2b1, p<2.2x10 -16 and L2 vs L1, p=0.038).Similar results were returned when considering growth rates from the Neolithic period (10Kya) until the present (NO vs C, p=2.98x10 -8 ; NO vs D, p=2.98x10 -8 ; NO vs K2b1, p=2.98x10 -8 and L2 vs L1, p=2.62x10 -6 ).

Ainu and Japanese
We also calculated a relative rate of increase for each lineage by dividing the lineage count in each interval by the count in the previous interval.Again, second layer lineages displayed higher growth rates during the period where trajectories overlapped (NO vs C, p<6.13x10 -10 ; NO vs D, p=9.55x10 -11 ; NO vs K2b1, p=5.70x10 -14 and L2 vs L1, p=5.91x10 -7 ).During the Neolithic period, all comparisons were significant (NO vs D, p=6.65x10 -3 ; NO vs K2b1, p=3.61x10 -3 and L2 vs L1, p=5.19x10 -5 ), aside from NO vs C (p=0.41).Supplementary Figure 23) Absolute growth rates of each Mtb and Y-chromosome lineage or haplogroup, based on the LTT trajectories generated using TRACER.Rates were calculated at regular intervals from timepoint zero to the coalescence of that lineage or haplogroup.Rates for putative first layer lineages are coloured in orange, and rates for putative second layer lineages in gray.
-Papuan Y-chromosome frequencies from a total of four populations described by Kayser et al. (2006) 14 were included (n=247).The majority of Papuan samples belonging to haplogroup K (n=183/200) could be unambiguously assigned to the K2b1 clade, due to the presence of markers belonging to either the M or S clades.It is likely that the remaining 17 K individuals also fall into lineage K2b1, as Karafet et al. (2015) 7 documented all 40 K(xNO) Papuan samples to belong to K2b1 sub-lineages in their analysis.
-To obtain haplogroup frequency estimates in Ainu populations, data from Tajima et al. (2004) 19 and Hammer et al. (2005) 20 was summarised.Together, this represents a summed sample size of 20 individuals (Frequencies: C -0.15, D -0.85, K2b1 -0, NO -0, Other -0).To obtain estimates of haplogroup frequencies in a cosmopolitan Japanese population, the Nonaka et al. (2007) 11 dataset was used.This dataset contains Y-chromosome haplogroups for 263 Japanese individuals from various provinces (Frequencies: C -0.053, D -0.39, K2b1 -0, NO -0.55,Other -0).Philippine hunter-gatherer and Philippine -Data from a pooled sample of 180 Indigenous Philippine samples described by Delfin et al. (2011) 21 was summarised.The enrichment signal of putative 'first-layer' Y-chromosome lineages is more pronounced if only groups with minimal degrees of non-Indigenous ancestry such as the Aeta and Agta are considered 21 .The study of Delfin et al. (2011) 21 does not genotype markers P397 and P399 which define lineage K2b1.To express this uncertainty, all K* lineages classified by Delfin et al. (2011) 21 were labelled as K2b1/K* for the purpose of Figure 1 (Main Text).It is likely however that the majority of these lineages fall into clade K2b1, as Karafet et al. (2015) 7 demonstrate with high resolution genotyping that K2b1 is present at a frequency of 0.60 in an Indigenous Philippine group.The results of this analysis hold rigorously if this dataset from Karafet et al. (2015) 7 is used, however the sample size (n=25) is significantly lower than that of Delfin et al. (2011) 21 (Frequencies: C -0.089, D -0, K2b1 -0.32, NO -0.6,Other -0).Haplogroup frequency estimates for a cosmopolitan Philippine population were obtained from Trejaut et al. (2014) 12 .This sample consisted of 146 Philippine Y-chromosome haplotypes from various populations of the Philippine archipelago (Frequencies: C -0, D -0, K2b1 -0.069, NO -0.85,Other -0.082).Malay hunter-gatherer and Malaysian -Data from Kutanan et al. (2019) 22 was summarised to provide Y-chromosome frequencies for Malay hunter-gatherer groups.These samples (n=4), which were all designated K* by Kutanan et al. (2019) 22 were labelled K2b1/K* for the purposes of Figure 2 (Main Text).These lineages are likely to belong to clade K2b1, as Karafet et al. (2015) 7 find 100% of Malay K(xNO/xR) lineages to belong to K2b1(Frequencies: C -0, D -0, K2b1 -1, NO -0, Other -0).For the Malaysian reference population the sample presented in Karafet et al. (2010) 5 was used.This dataset consists of 32 Malaysian individuals (Frequencies: C -0.031, D -0.031, K2b1 -0.094, NO -0.75,Other -0.094).